Skip to content

fix(hub): stop launching system updaters after manager shutdown#1951

Open
BootstrapperSBL wants to merge 1 commit intohenrygd:mainfrom
BootstrapperSBL:fix/system-updater-teardown-race
Open

fix(hub): stop launching system updaters after manager shutdown#1951
BootstrapperSBL wants to merge 1 commit intohenrygd:mainfrom
BootstrapperSBL:fix/system-updater-teardown-race

Conversation

@BootstrapperSBL
Copy link
Copy Markdown
Contributor

Description

Fixes #1950.

The hub test suite intermittently panics on Apple Silicon during go test ./internal/alerts/... with a nil pointer dereference inside PocketBase's RecordQuery. The trace ends in StartUpdater -> setDown -> getRecord -> FindRecordById, with the goroutine created by SystemManager.AddSystem.

Root cause

SystemManager.Initialize spawns a background goroutine that walks the systems loaded from the database and calls AddSystem for each one with a time.Sleep between calls (up to 2s per system). When a test's defer hub.Cleanup() runs while that loop is still sleeping, the loop is not aware of the shutdown:

  1. RemoveAllSystems iterates sm.systems.GetAll() and cancels every system already added.
  2. TestApp.Cleanup calls ResetBootstrapState, nilling the underlying DB.
  3. The staggered loop wakes from its sleep, calls AddSystem for the next system, and starts a StartUpdater goroutine whose sys.manager.hub already has a nil DB. The first update() call inevitably fails (no real agent), setDown -> getRecord -> FindRecordById then panics inside app.ConcurrentDB().

The window is much wider on Apple Silicon because of weaker memory ordering and slower TCP teardown, which is why x86_64 builds usually pass.

Fix

Give SystemManager its own context.Context that is cancelled on shutdown:

  • The staggered Initialize loop replaces time.Sleep(sleepTime) with a select on sm.ctx.Done() and returns early once shutdown is signalled.
  • AddSystem refuses to add a system (and therefore does not spawn StartUpdater) once sm.ctx is cancelled.
  • The test-only RemoveAllSystems helper cancels sm.ctx before tearing systems down, so by the time TestApp.Cleanup clears the DB the staggered loop is guaranteed not to start any new updaters.

No recover() is added; the production paths are unchanged on the happy path.

Before / After

Before (from the issue, also reproducible locally on darwin/arm64):

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0xb0]
goroutine N [running]:
github.com/pocketbase/pocketbase/core.(*BaseApp).RecordQuery(...)
github.com/henrygd/beszel/internal/hub/systems.(*System).getRecord(...)
github.com/henrygd/beszel/internal/hub/systems.(*System).setDown(...)
github.com/henrygd/beszel/internal/hub/systems.(*System).StartUpdater(...)
created by github.com/henrygd/beszel/internal/hub/systems.(*SystemManager).AddSystem
FAIL    github.com/henrygd/beszel/internal/alerts

After: go test ./internal/alerts/... ./internal/hub/systems/... is green across 5 consecutive runs on darwin/arm64.

Changelog

Fixed

…ygd#1950)

The staggered Initialize goroutine kept calling AddSystem after the hub
had been torn down, which spawned StartUpdater goroutines that later
dereferenced a nil DB inside PocketBase. Track a manager-level context
that RemoveAllSystems cancels, abort the staggered loop on shutdown,
and refuse AddSystem once the manager is stopped.
@BootstrapperSBL
Copy link
Copy Markdown
Contributor Author

Heads up: the VulnCheck failure here is pre-existing on maingolang.org/x/image@v0.38.0 has two advisories (GO-2026-4961 / 4962) that are fixed in 0.39.0. Same failure is blocking #1946 and #1932. Once you bump the dep on main this PR should go green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Panic / Nil pointer dereference in internal/hub/systems/system.go during tests

1 participant